This report was provided in R markdown (format:html).

Q1 - What financial topics do consumers discuss on social media and what caused the consumers to post about this topic?

Analyzing the model generated topics using a clustering algorithm (seen below in large blue bars) reveals four categories.

They are:

  1. News & marketing
  2. Complaints about fees associated with banking
  3. Banking customer service complaints
  4. Overall Dissatisfaction with banks

The most frequently ocurring topic in the data was about stock trading news. The second most popular topic in the data was about a marketing campaign organized around the hashtag #getcollegeready. In aggregate, customer service related topics (clusters 2, 3, and 4 - In this report, subsequently referred to as Hot Topics.) accounted for approximately 15-20% of the data.

Click anywhere on the object below to collapse or expand sections/clusters or to get more information like high probability words and representative sample documents.

## Performing hierarchical topic clustering ... 
## Generating JSON representation of the model ...

Deliverable A - Describe your Approach and Methodology. Include a visual representation of your analytic process flow.

In order to address the questions raised by the challenge, multiple topic models using different algorithms were generated from data that had been filtered to include only relevant banks, cleaned, and then processed. Once topic models were evaluated for their usefulness, a model was selected. Visualizations were created to interpret the output of the model and used to explore in detail the topics identified. Once the customer-related topics were identified, analysis was focused on those topics. Bank name related sentiment of the records within each topic, correlation of bank names and frequent words was calculated, and the distribution of bank names across Hot Topics was explored.

Deliverable B - Discuss the data and its relationship to social conversation drivers.

In analyzing the relationship between the content of posts and what was driving the converstations taking place, a sentiment analysis was performed. First, a polarity score was calculated for all of the documents in the corpus used by our model. Then, in order to better understand the positive or negative relationship between posters and certain topics, individual topics were scored for sentiment as well. Now we can compare the sentiment, represented by the Standardized Polarity, of All Topics to that of the Hot Topics. Topics 3 & 19 (related to customer service/experience) contain the most relative negative sentiment.

##            Standardized Polarity
## All Topics            0.09322253
## Topic 2              -0.20194620
## Topic 3              -0.55033973
## Topic 6              -0.38922330
## Topic 12             -0.01211667
## Topic 19             -0.45077812

In order to explore further the relationship between topics, a plot was constructed to compare correlations between topics. You can select topics 3 and 19 to explore the relationship, and click on the plots to explore high probability words and representative (stemmed) samples.

*By selecting the classification variable ‘MediaType’ from the color drop down, we see whether the comment came from Twitter or FaceBook. Select Color=MediaType on the below visualization and explore the topics.

## Sampling 5000 documents for visualization.